About the project

This is a course about open data, open science and data science. I hope to learn more about how to use R and other tools for statistical analysis.

Click here to access my github repository


Regression and Model Validation

Introducing the Data

library(dplyr)
lrn14 <- read.table("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/learning2014.txt ", sep=",", header=TRUE)
dim(lrn14)
## [1] 166   7

We are examinig a dataset that has information about students’ attitudes towards learning statistics. The dataset has 7 variables and 166 observations. We know the age, gender and how many points each students got in the exam. In addition, we we have divided the questions measuring student’s attitude toward learning into three subcategories: deep, strategic and surface learning. Each of these three variables illustrates the average of a student’s answers to questions in each category in a scale from 1 to 5. Click here to learn more about the variables.

Overview of the Data

Now let us examine the data more in detail. From the graph below we can make following statements:

  • We can see that we have nearly twice as much female students than male students. We can also see that the most of the students in our data are between 20 and 30 years old an that male students are on the average slightly older than female students.

  • Male students have on the average a slightly better general attitude towards statistics than female students. The average attitude is 3.4 for male students and 3 for female students.

  • Questions related to different types of learning have almost no difference between female and male students. The only exception being surface learning, where we can see a clearly smaller variance and higher mean among female students.

  • The average amount of points in the final exam is 22.7. Male students were slightly better with an average of 23.5.

library(ggplot2)
library(GGally)

ggpairs(lrn14, mapping = aes(col = gender, alpha = 0.3),
            lower = list(combo = wrap("facethist", bins = 20)))

Fitting the Regression Model

We will build a regression model that has exam points as dependent variable. We will choose three variables correlated highly with points and use them as the explanatory variables in our model. The three most highly correlated variables are attitude, stra and surf.

The variable attitude is statistically significant with 0.1% confidence level. The other two variables are not statistically significant on their own. In order to find out whether a model with fewer variables would be better, we will have to carry out a F-test. So we will test the null hypothesis that the coefficients of the variables stra and surf are not statistically different from zero against hypothesis that either both or one of them is. The p-value of the F-statistics related to this question is 0.18 so we fail to reject the null hypothesis even with 10 percent confidence level. This means that a model with just the attitude variable as an explanatory variable would be more suitable.

library(sandwich)
library(lmtest)
library(car)

model_1 <- lm(points ~ attitude + stra + surf, data = lrn14)
summary(model_1)
## 
## Call:
## lm(formula = points ~ attitude + stra + surf, data = lrn14)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.0171     3.6837   2.991  0.00322 ** 
## attitude      3.3952     0.5741   5.913 1.93e-08 ***
## stra          0.8531     0.5416   1.575  0.11716    
## surf         -0.5861     0.8014  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08
linearHypothesis(model_1, c("surf = 0", "stra = 0"))  
## Linear hypothesis test
## 
## Hypothesis:
## surf = 0
## stra = 0
## 
## Model 1: restricted model
## Model 2: points ~ attitude + stra + surf
## 
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1    164 4641.1                           
## 2    162 4544.4  2    96.743 1.7244 0.1815

Model Interpretation

model_2 <- lm(points ~ attitude, data = lrn14)
summary(model_2)
## 
## Call:
## lm(formula = points ~ attitude, data = lrn14)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

Our fitted model is: points = 11.6 + 3.5*attitude

  • The coefficient attached to the variable attitude is 3.5 which means that a one point increase in the general attitude towards statistics increases exam points by 11.6 points.

  • The intercept is 11.6 which illustrates a student’s exam points in the hypothetical situation where his or her attitude towards statistics would be zero.

  • The multiple R-squared of our model is 0.19 which means that almost 20% of the variance of our dependent variable is explained by our explanatory variables.

Analysing the Diagnostic Plots

We can examine the normality assumption by examining the QQ-plot. We can see that our observations fit the line quite nice except at the start and end of the line. Based on this plot we can assume that the errors are normally distributed.

plot(model_2, which= c(2))

Another assumption of the linear regression model is that the size of error does not depend on the value of the explanatory variables. We can see whether this assumption holds by examining the Residuals vs Fitted plot. We cannot find a clear pattern in the distribution of the errors so it is safe to state that this assumption holds.

plot(model_2, which= c(1))

The Residual vs Leverage plot allows us to study whether any observation has relatively higher leverage than others. We can easily see that there is no single observations that stands out.

plot(model_2, which= c(5))


Logistic Regression

Introducing the data

##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "nursery"    "internet"   "guardian"   "traveltime"
## [16] "studytime"  "failures"   "schoolsup"  "famsup"     "paid"      
## [21] "activities" "higher"     "romantic"   "famrel"     "freetime"  
## [26] "goout"      "Dalc"       "Walc"       "health"     "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"

The data what we are soon going to analyse is a about student alcohol consumption and school achievements. It is combined from two larger datasets about student achievement in two different courses (mathematics and Portugese language) in secondary education of two Portugese schools. In total, we have 382 observations and 35 variables. These variables consists of students spesific information such as age and sex, as well as information about his or hers family background and living habits. On top of this, we have the student’s grades in periods 1 to 3 (average of the two courses). We have also created two variables describing how much alcohol the student uses in total and whether he or she is a high user.

Alcohol consumption and other variables

Let us explore the relationships between alcohol consumption and four other interesting variables.

Variable Prediction
Sex Male students drink more tha female students and are more often high users
Failures The number of failures is bigger if alcohol use is high
Absences The number of absences is bigger if alcohol use is high
Romantic Students who are in a romantic relationship drink less

Graphical overview

  • We have nearly the same amount of female and male students, but there are more high alcohol users among male students as was predicted. X% of male students are high users compared to X% of female students.

  • The mean number of absences is 5.3 in the whole population. The number is clearly bigger among students who drink more. High users have nearly two times more absences of they are female and over two times if they are male. This corresponds nicely with our predictions.

  • The mean number of past failures is 0.3. If we compare the means between high users and non-high-users, we can see that the number is about two times bigger among high users. Again, this is in line with our previous predictions.

  • There are more female students in romantic relationship than male students. Overall, there are more students not in a relationship than those who are. This is the case among high users as well as predicted.

## # A tibble: 4 x 5
## # Groups:   high_use [?]
##   high_use sex   count mean_fail mean_abs
##   <lgl>    <fct> <int>     <dbl>    <dbl>
## 1 F        F       157     0.204     4.97
## 2 F        M       113     0.239     3.19
## 3 T        F        41     0.439     8.90
## 4 T        M        71     0.479     7.41

The Model

Let us examine the summary of our fitted model:

## 
## Call:
## glm(formula = high_use ~ failures + absences + sex + romantic, 
##     family = "binomial", data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6507  -0.8256  -0.6131   1.0339   2.1292  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -1.8769     0.2381  -7.882 3.23e-15 ***
## failures      0.4142     0.1511   2.741  0.00613 ** 
## absences      0.0752     0.0182   4.132 3.59e-05 ***
## sexM          0.9757     0.2451   3.980 6.88e-05 ***
## romanticyes  -0.2806     0.2655  -1.057  0.29063    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 462.21  on 381  degrees of freedom
## Residual deviance: 417.51  on 377  degrees of freedom
## AIC: 427.51
## 
## Number of Fisher Scoring iterations: 4

We can see that all but one of our coefficient are statistically different from zero. The coefficients of absences and gender are significant even with a confidence level of 0.1%. The coefficient of failures is significant with a 5% confidence level and the coefficient of romantic relationship is not significant even with a 10% level. This means that we could leave this variable from our model.

If we apply the exponent function to our model parameters, We can interpret them as odds ratios. The confidence interval tells us that the value of the coefficient is within these limits with 95% of the time.

##             odds_ratio 2.5 % 97.5 %
## (Intercept)       0.15  0.09   0.24
## failures          1.51  1.12   2.04
## absences          1.08  1.04   1.12
## sexM              2.65  1.65   4.32
## romanticyes       0.76  0.44   1.26
  • The intercept represents the likelyhood of being a high user for female students who are not in a romantic relationship and have no absences or past failures. As it is smaller than one, it means that individuals in this group are less likely to be high users than others.

  • Each additional past failure increases the likelyhood of being a high user. That is, student with 2 past failures is 1.51 times more likely to be a high user than a student with just one absence.

  • Each additional absence increases the likelyhood of being a high user. That is, student with 2 absences is 1.08 times more likely to be a high user than a student with just one absence.

  • Male students are 2.65 times more likely to be high users of alcohol than female students.

  • Students in aromantic relationship are less likely to be high users of alcohol. It is important to note that the confidence interval of this variable includes the number zero, so we cannot know for sure what kind of impact better grades actually have.

Now if we compare these results to our earlier predictions, we can see that they correspond eachother very nicely. Only deviation from our prediction is that the model suggests that the effect of romantic relationship might be ambigious.

Testing the Predictive Power

The model we are going explore is the following. We dropped the variable romantic from our model as it was not statistically significant.

high_use ~ failures + absences + sex

## 
## Call:
## glm(formula = high_use ~ failures + absences + sex, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6629  -0.8545  -0.5894   1.0339   2.0428  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.95397    0.22819  -8.563  < 2e-16 ***
## failures     0.40462    0.15024   2.693  0.00708 ** 
## absences     0.07294    0.01796   4.061 4.88e-05 ***
## sexM         0.98848    0.24453   4.042 5.29e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 462.21  on 381  degrees of freedom
## Residual deviance: 418.64  on 378  degrees of freedom
## AIC: 426.64
## 
## Number of Fisher Scoring iterations: 4

Now we can test the predictive power of our model. As we can see from the table below, our model misclassified 12 non-high-users as high users and 86 high users to non-high-users. The model succeeded in predicting a high-users 26 times. The same results are displayed in the plot below labelled as “Actual values and the predictions”.

##         prediction
## high_use FALSE TRUE
##    FALSE   258   12
##    TRUE     86   26

## [1] 0.2565445

Now we can compute the average number of wrong predictions by defining a loss function. In this case the number is 0.26 which is relatively low.

Model Prediction vs. Guessing

We know that nearly on thrid of the students are high users of alcohol. No we can compare the performance of our model to a simple model that classifies every third person as a high user. Now we can see that the prediction error is nearly two times bigger. So at least our model is better at predicing high users than this simple guessing strategy.

## [1] 0.4319372

Cross-Validation

## [1] 0.2591623

Now let us perform a 10-fold cross-validation on our model. The model has a test ser performance equal to the data Camp as the models have identical predictors. The prediction error is 0.2591623


Clustering and Classification

Introducing the Data

## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

This week we are going to explore crime rates in different parts of Boston. Our dataset has 506 observations and 14 variables. In addition of crime rate, we know different attributes of the towns. You can find more information about the variables from here

Variable Description
crim per capita crime rate by town.
zn proportion of residential land zoned for lots over 25,000 sq.ft.
indus proportion of non-retail business acres per town.
chas Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
nox nitrogen oxides concentration (parts per 10 million).
rm average number of rooms per dwelling.
age proportion of owner-occupied units built prior to 1940.
dis weighted mean of distances to five Boston employment centres.
rad index of accessibility to radial highways.
tax full-value property-tax rate per $10,000.
ptratio pupil-teacher ratio by town.
black 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.
lstat lower status of the population (percent).
medv median value of owner-occupied homes in $1000s.

Graphical Overview

Before starting the analysis, let’s have look of our data. Below we can see to plot matrices. From the first one, we can see the distributons of the variables in our dataset. The second one illustrates the correlations between the variables. The bigger the circle, the bigger the correlations of those variables. The color of the circles tells us the whether the correlation is negative (red) or positive (blue).

Standardize the Dataset

After having standardized the dataset we can instantly see that the mean of every variable is equal to zero. In addition, we can observe that the variances of the variables are all equal to one.

##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865
##          crim    zn indus  chas   nox    rm   age   dis   rad   tax
## crim     1.00 -0.20  0.41 -0.06  0.42 -0.22  0.35 -0.38  0.63  0.58
## zn      -0.20  1.00 -0.53 -0.04 -0.52  0.31 -0.57  0.66 -0.31 -0.31
## indus    0.41 -0.53  1.00  0.06  0.76 -0.39  0.64 -0.71  0.60  0.72
## chas    -0.06 -0.04  0.06  1.00  0.09  0.09  0.09 -0.10 -0.01 -0.04
## nox      0.42 -0.52  0.76  0.09  1.00 -0.30  0.73 -0.77  0.61  0.67
## rm      -0.22  0.31 -0.39  0.09 -0.30  1.00 -0.24  0.21 -0.21 -0.29
## age      0.35 -0.57  0.64  0.09  0.73 -0.24  1.00 -0.75  0.46  0.51
## dis     -0.38  0.66 -0.71 -0.10 -0.77  0.21 -0.75  1.00 -0.49 -0.53
## rad      0.63 -0.31  0.60 -0.01  0.61 -0.21  0.46 -0.49  1.00  0.91
## tax      0.58 -0.31  0.72 -0.04  0.67 -0.29  0.51 -0.53  0.91  1.00
## ptratio  0.29 -0.39  0.38 -0.12  0.19 -0.36  0.26 -0.23  0.46  0.46
## black   -0.39  0.18 -0.36  0.05 -0.38  0.13 -0.27  0.29 -0.44 -0.44
## lstat    0.46 -0.41  0.60 -0.05  0.59 -0.61  0.60 -0.50  0.49  0.54
## medv    -0.39  0.36 -0.48  0.18 -0.43  0.70 -0.38  0.25 -0.38 -0.47
##         ptratio black lstat  medv
## crim       0.29 -0.39  0.46 -0.39
## zn        -0.39  0.18 -0.41  0.36
## indus      0.38 -0.36  0.60 -0.48
## chas      -0.12  0.05 -0.05  0.18
## nox        0.19 -0.38  0.59 -0.43
## rm        -0.36  0.13 -0.61  0.70
## age        0.26 -0.27  0.60 -0.38
## dis       -0.23  0.29 -0.50  0.25
## rad        0.46 -0.44  0.49 -0.38
## tax        0.46 -0.44  0.54 -0.47
## ptratio    1.00 -0.18  0.37 -0.51
## black     -0.18  1.00 -0.37  0.33
## lstat      0.37 -0.37  1.00 -0.74
## medv      -0.51  0.33 -0.74  1.00

Now we will create a new categorical variable for crime rate and divide the dataset into two parts; one for trainng and one for testing.

Linear Discriminant Analysis

Next, we will fit the linear discriminant analysis using our train set. Our dependant variable is crime rate and rest of the variables are used as predictors. The plot below illustrates the results. We can see that high crime rate is clearly separeted from the rest and it is the variable rad that predicts this separation. From low to medium high crime rate, it seems that variables zn and nox are the biggest determinants.

Predictions

Now, we will examine the predictive power of this analysis. If our model had categorised every observation correctly, the table below would have only zeros execpt for the diagonal. We can see that this is not the case, even though most of the observations are correctly predicted. In addition, we can see that our model succeeded better in categorising the high crime rates as it did for the low.

##           predicted
## correct    low med_low med_high high
##   low       13      11        1    0
##   med_low    4      15        8    0
##   med_high   1      10       12    1
##   high       0       0        1   25

K-means

Lastly, we will perform the k-means algorithm on our dataset. We will start this by reloding the data, scaling it and calulating the distances between the variables.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2662  8.4832 12.6090 13.5488 17.7568 48.8618

Now, let’s run the k-means algorithm with 4 clusters. We can decide the optimal number of cluters from the plot below. The opitmal number is found at the point when the within cluster sum of squares (WCSS) drops radically. In this case, this point is when there are two clusters.

Let’s run the algorithm again and plot it. We can see that the k-means with two clusters works quite nicely. We can see that there are usually two different distributions in in each variable.

Super Bonus

Interactive plot:

library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404  13
dim(lda.fit$scaling)
## [1] 13  3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)

plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3,
        type= 'scatter3d', mode='markers', color = train$crime)
## Another plot
km2 <- kmeans(boston_scaled[as.numeric(row.names(train)),], centers = 2)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3,
        type= 'scatter3d', mode='markers', color = km2$cluster)

Dimensionality Reduction Techniques

Graphical Overview

We will begin our analysis by examining our data graphically. We have two plot matrices below. From the first one, we can see the distributons of the variables in our dataset. The second one illustrates the correlations between the variables. The bigger the circle, the bigger the correlations of those variables. The color of the circles tells us the whether the correlation is negative (red) or positive (blue).

From these two plots, we can make for example, the following statements:

  • Countries that have longer life expectancy tend to have higher expected years of education and per capita GNI.
  • Countries that have higher maternal mortality rate have lower life expectancy and more teenage pregnancies

Principal Component Analysis I

First, we will perform the principal component analysis (PCA) on the not standardized human data. If we examine the variability captured by each principal component seen below, we can see that the first component captures all of it. Now, if we take a look of our biplot, it doesn’t seem to be desirable.

Variability between dimensions
x
PC1 100
PC2 0
PC3 0
PC4 0
PC5 0
PC6 0
PC7 0
PC8 0

FIGURE I: The first PCA dimension seems to capture all of the variation

Principal Component Analysis II

Next, let’s see how the resluts change, if we standardize the data before performing the analysis. The change is remarkable. Now we can see that some of the variance is captured by the other variables as well. The biplot looks more readable as well. This change is due the fact that one of the variables in the data (GNI.pc) has largely higher variance compared to the others, and this confuses the model.

Variability between dimensions
x
PC1 53.6
PC2 16.2
PC3 9.6
PC4 7.6
PC5 5.5
PC6 3.6
PC7 2.6
PC8 1.3

FIGURE II: The first PCA dimension has variables stretching to both directions. Maternal mortality and teenage pregnancies contribute to the opposite direction than others such as life expectancy. The number of female representatives in the parliament contributes to the second PCA dimension along with labor market ratio.

Interpretation

Now, we will analyse the results from our second PCA. If we take a closer look of the biplot of our second analysis, we can see that the variables Rep and Lab.r contribute to the second dimension. Variables mat.mort and Ad.birth contribute to the first dimension but to the opposite direction than the rest of the variables. These results can also be seen if examine the summary of our analysis below; The second principal component dimension has two clearly higher values and the first has two clearly positive while others are negative or close to zero.

Summary of PCA II
PC1 PC2
Edu.r -0.36 0.04
Lab.r 0.05 0.72
Life.exp -0.44 -0.03
Edu.exp -0.43 0.14
GNI.pc -0.35 0.05
Mat.mort 0.44 0.15
Ad.birth 0.41 0.08
Rep -0.08 0.65

Multiple Correspondance Analysis

Now, let us examine a new dataset tea from the package Factominer. The data relates to a questionnaire about tea consumption. It has 300 observations and 36 variables. The variables include questions about how the tea is consumed and percieved, as well as few individual attributes. We are not going to analysise the entire dataset, instead we have selected 9 most interesting variables.

##  sex                     how           sugar        How     
##  F:178   tea bag           :170   No.sugar:155   alone:195  
##  M:122   tea bag+unpackaged: 94   sugar   :145   lemon: 33  
##          unpackaged        : 36                  milk : 63  
##                                                  other:  9  
##              escape.exoticism         diuretic           breakfast  
##  escape-exoticism    :142     diuretic    :174   breakfast    :144  
##  Not.escape-exoticism:158     Not.diuretic:126   Not.breakfast:156  
##                                                                     
##                                                                     
##        lunch               sophisticated
##  lunch    : 44   Not.sophisticated: 85  
##  Not.lunch:256   sophisticated    :215  
##                                         
## 
## 'data.frame':    300 obs. of  9 variables:
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...

Next, lets perform the MCA on our data. The summary of our analysis is seen below. The eigenvalues tell us how much of the variance is retained by each dimension. We can see that half of the variance is retained by the 5 first dimensions. We can also note that none of the variables is highly associated with the three dimensions. This can be seen from the categorical values section, as none of the values is close to one.

## 
## Call:
## MCA(X = tea_time, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6
## Variance               0.163   0.154   0.129   0.121   0.118   0.116
## % of var.             12.214  11.560   9.647   9.050   8.885   8.694
## Cumulative % of var.  12.214  23.774  33.422  42.471  51.357  60.051
##                        Dim.7   Dim.8   Dim.9  Dim.10  Dim.11  Dim.12
## Variance               0.106   0.100   0.092   0.087   0.075   0.072
## % of var.              7.958   7.516   6.866   6.536   5.658   5.415
## Cumulative % of var.  68.009  75.525  82.391  88.927  94.585 100.000
## 
## Individuals (the 10 first)
##                       Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1                  |  0.724  1.072  0.477 | -0.266  0.153  0.064 |  0.050
## 2                  |  0.282  0.162  0.061 |  0.676  0.990  0.350 | -0.580
## 3                  | -0.232  0.110  0.059 |  0.149  0.048  0.025 | -0.088
## 4                  |  0.212  0.092  0.052 | -0.739  1.182  0.630 | -0.230
## 5                  |  0.269  0.149  0.070 |  0.201  0.087  0.039 | -0.392
## 6                  |  0.221  0.100  0.046 | -0.319  0.220  0.095 |  0.074
## 7                  |  0.411  0.346  0.156 | -0.026  0.001  0.001 |  0.034
## 8                  | -0.111  0.025  0.011 |  0.018  0.001  0.000 | -0.196
## 9                  |  0.267  0.146  0.052 |  0.295  0.188  0.064 |  0.167
## 10                 |  0.146  0.044  0.018 |  0.426  0.393  0.155 |  0.275
##                       ctr   cos2  
## 1                   0.007  0.002 |
## 2                   0.873  0.258 |
## 3                   0.020  0.009 |
## 4                   0.137  0.061 |
## 5                   0.399  0.149 |
## 6                   0.014  0.005 |
## 7                   0.003  0.001 |
## 8                   0.099  0.035 |
## 9                   0.072  0.020 |
## 10                  0.195  0.064 |
## 
## Categories (the 10 first)
##                        Dim.1     ctr    cos2  v.test     Dim.2     ctr
## F                  |  -0.427   7.368   0.266  -8.911 |   0.286   3.497
## M                  |   0.622  10.750   0.266   8.911 |  -0.417   5.102
## tea bag            |   0.233   2.095   0.071   4.603 |  -0.098   0.390
## tea bag+unpackaged |  -0.135   0.387   0.008  -1.573 |   0.550   6.835
## unpackaged         |  -0.748   4.577   0.076  -4.774 |  -0.975   8.225
## No.sugar           |  -0.549  10.634   0.322  -9.819 |   0.409   6.239
## sugar              |   0.587  11.368   0.322   9.819 |  -0.438   6.669
## alone              |  -0.304   4.101   0.172  -7.166 |  -0.235   2.597
## lemon              |   0.447   1.497   0.025   2.715 |  -0.275   0.598
## milk               |   0.789   8.913   0.165   7.032 |   0.741   8.319
##                       cos2  v.test     Dim.3     ctr    cos2  v.test  
## F                    0.119   5.972 |   0.035   0.061   0.002   0.722 |
## M                    0.119  -5.972 |  -0.050   0.089   0.002  -0.722 |
## tea bag              0.012  -1.932 |  -0.554  15.015   0.401 -10.952 |
## tea bag+unpackaged   0.138   6.426 |   0.829  18.612   0.314   9.686 |
## unpackaged           0.130  -6.226 |   0.450   2.100   0.028   2.874 |
## No.sugar             0.179   7.317 |  -0.025   0.028   0.001  -0.448 |
## sugar                0.179  -7.317 |   0.027   0.030   0.001   0.448 |
## alone                0.103  -5.548 |   0.091   0.470   0.016   2.156 |
## lemon                0.009  -1.670 |   1.037  10.213   0.133   6.303 |
## milk                 0.146   6.609 |  -0.600   6.540   0.096  -5.353 |
## 
## Categorical variables (eta2)
##                      Dim.1 Dim.2 Dim.3  
## sex                | 0.266 0.119 0.002 |
## how                | 0.103 0.214 0.414 |
## sugar              | 0.322 0.179 0.001 |
## How                | 0.222 0.185 0.274 |
## escape.exoticism   | 0.002 0.005 0.148 |
## diuretic           | 0.086 0.220 0.090 |
## breakfast          | 0.119 0.267 0.004 |
## lunch              | 0.028 0.148 0.212 |
## sophisticated      | 0.316 0.049 0.014 |

Finally, we will plot both the variables and the individuals. From the individual plot we can say, that there aren’t really any observations that stand out. The variable plot, on the other hand, we can see that, for example unpackaged differs from the rest.